When we talk about clustering, we often talk about the concept of ‘unsupervised learning’ method, that also brings up the concept of ‘supervised learning’ method.
When we talk about supervised machine learning, it is when there is a defined relationship between the independent and dependent variables of X and Y respectively. If it is for numerical inputs, it is regression and if it is for a class variable, then it is classification.
However, data is never straightforward and there are often instances wherein there is no relationship between the each of the variables. During those instances, employing the un-supervised learning method is the best choice. Clustering, is an unsupervised learning technique that looks for data with no labels, and discovers clusters or groups when there is similarity between the data points. This helps in understanding the internal structure of the data and to understand the patterns within the dataset. In the economic dataset, that contains the fiscal and monetary data of BRICS nations, clustering could be useful to understand if there are well-defining indicators that highlight the performance of each nation, let’s say in terms of GDP. For instance, if a set of features are clustering together, it means there are some features that are well-defined to those specific instances. If certain variables are clustering, it means that maybe there is some sort of information that could be explored to ethically understand the contribution of factors.
K-Means Clustering
-talk and introduce about clustering and the types - introduce about k means clustering and what it does, and cite it -
# to ignore future warningswarnings.filterwarnings("ignore", category=FutureWarning, module="sklearn.cluster._kmeans")# optimal clustering using the elbow methodwcss = [] for i inrange(1, 8): kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeans.fit(x) wcss.append(kmeans.inertia_)# Plotting the WCSS valuesplt.plot(range(1, 8), wcss, marker='o')plt.title('Elbow Method for Optimal k')plt.xlabel('Number of Clusters')plt.ylabel('WCSS') plt.show()
In the code above, to avoid the warnings, the warnings module has been used. For k-means clustering, finding the optimal amount of clustering is important for better interpretability, higher performance and for informed decision making. The elbow method is used to find the optimal amount of clusters ‘K’. It uses the WCSS (Within-Cluster Sum of Square), what is calculated using the distance between the points of the cluster centroid and the points of the cluster. So a loop has been created where for each value of k between 1 and 20, we calculate the WCSS amd plot it that resembles an Elbow. As the clusters increase, the WCSS value decreses. The optimal k value is the change in the shape of the point. The optimal amount of cluster here is 6.
Optimal K-Means : Silhoutte Score Method
Code
# silhoutte score to find optimal k clustersrange_clusters =range(2, 8)silhoutte = []for n in range_clusters: kmeans = KMeans(n_clusters=n, random_state=2339) kmeans.fit(x) cluster_labels = kmeans.labels_ silhoutte.append(silhouette_score(x, cluster_labels))# plottingplt.plot(range_clusters, silhoutte, 'bx-')plt.xlabel('Number of Clusters (k)')plt.ylabel('Silhouette Score')plt.title('Silhouette Analysis for Optimal k')plt.show()optimal_clusters = range_clusters[np.argmax(silhoutte)]print(f"Optimal Number of Clusters: {optimal_clusters}")
Optimal Number of Clusters: 6
Finding the optimal k clusters can also be done through the silhouette score method that quantifies how similar the data point is within a cluster, referred to as ‘cohesion’ in comparison to the other clusters called ‘separation’.
The silhouette coefficient or silhouette score kmeans is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation). In the code above, within the self-defined range of k, we calculate the silhouette scors for each iteration and try to find an optimal k that gives the maximum silhoutte scores. As it ranges between -1 and +1, a higher silhoutte scores’ optimal cluster is chosen since it showcases more distinct and well-defined clusters space. Through this as well, we see that optimal k value is 6.
Optimal K-Value Clustering
Without Feature Extraction
Code
k =6kmeans = KMeans(n_clusters=k, n_init=10, random_state=2339)optimal_kmeans = kmeans.fit_predict(x)optimal_kmeans
One of my questions that I wanted to ask was about relationship between External Debt Shocks and the GDP growth. External debt shocks and GDP growth have a complex relationship that is influenced by debt levels, composition, and terms, as well as external factors and policy responses. High and unsustainable debt can stifle economic progress, and the mix of concessional and commercial loans influences the outcome. Debt terms, foreign shocks, and a country’s policy reaction are all important considerations. Global economic conditions and country-specific factors such as governance and political stability add to the complexities of this relationship. Hence, one of the ways in which without the influence mapping between the dependant and independant variable, I wanted to find out relationship between External Debt Shocks and GDP Growth. It is seen that there is clustering where certain clusters being exclusive but most clusters overlapping and non-exclusive in nature. Two clusters are properly created, whereas the other two aren’t.
/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
If you look at the overall relationships of all the variables with one another, this also gives us an interesting plot to look at.
With Feature Extraction
Feature Extraction is one of the important components for well-defined and better performing clusters, hence an attempt is made to see if they actually help.
Code
from sklearn.decomposition import PCAfrom sklearn.manifold import TSNE# before pca k_means_before = KMeans(n_clusters=6, random_state=42)optimal_kmeans_before = k_means_before.fit_predict(x)# after pca pca = PCA(n_components=4)optimal_pca_kmeans = pca.fit_transform(x)kmeans_after_pca = KMeans(n_clusters=6, random_state=42)labels_after_pca = kmeans_after_pca.fit_predict(optimal_pca_kmeans)# after tsne tsne = TSNE(n_components=3, perplexity=2, random_state=42)optima_tsne_kmeans = tsne.fit_transform(x)kmeans_after_tsne = KMeans(n_clusters=6, random_state=42)labels_after_tsne = kmeans_after_pca.fit_predict(optima_tsne_kmeans)silhouette_score_before_pca = silhouette_score(x, optimal_kmeans_before)print(f"Silhouette Score before PCA: {silhouette_score_before_pca:.4f}")silhouette_score_after_pca = silhouette_score(optimal_pca_kmeans, labels_after_pca)print(f"Silhouette Score after PCA: {silhouette_score_after_pca:.4f}")silhouette_score_after_tsne = silhouette_score(optima_tsne_kmeans, labels_after_tsne)print(f"Silhouette Score after TSNE: {silhouette_score_after_tsne:.4f}")
Silhouette Score before PCA: 0.4454
Silhouette Score after PCA: 0.4803
Silhouette Score after TSNE: 0.2795
Code
col = ('gdp_growth', 'ex_debt_shocks')indices = [x.columns.get_loc(c) for c in col]print(f"The indices of the columns {col} are: {indices}")
The indices of the columns ('gdp_growth', 'ex_debt_shocks') are: [8, 6]
I’m trying to find the indices so that I can compare the exact values for clustering before and after pca
Code
evr = pca.explained_variance_ratio_cev = np.cumsum(evr)print("Explained Variance Ratio for Each Component:")print(evr*100)# using pca 1 and pca 2 are better
Explained Variance Ratio for Each Component:
[37.66203489 22.03607405 14.10907775 9.93647105]
/var/folders/cm/1bq_zvw92w99j_5d1p5jq5v40000gn/T/ipykernel_71191/353519608.py:16: MatplotlibDeprecationWarning: Auto-removal of overlapping axes is deprecated since 3.6 and will be removed two minor releases later; explicitly call ax.remove() as needed.
ax = plt.subplot(1, 3, 3)
Text(0, 0.5, 't-SNE Component 2')
Here, you can see that using PCA as a feature extraction method has made the clusters more pronounced. You can also see that there now, I can see the association between the GDP growth vs the external debt shocks.
DBSCAN
explain what dbscan is
what it does
how it is found
etc
why am i using dbscan as an alternative to k means clustering
Optimal Parameter Tuning
Code
for eps in [i/10for i inrange(4, 14)]:for min_samples inrange(4, 12):print("\neps={}".format(eps))print("min_samples={}".format(min_samples))# Apply DBSCAN dbscan = DBSCAN(eps=eps, min_samples=min_samples) labels = dbscan.fit_predict(x)# Check if there is only one unique labeliflen(np.unique(labels)) ==1:print("Only one cluster found.")else:# Calculate Silhouette Score silh = silhouette_score(x, labels)# Print cluster informationprint("Clusters present: {}".format(np.unique(labels)))print("Cluster sizes: {}".format(np.bincount(labels +1)))print("Silhouette Score: {}".format(silh*100))
This is the code I had worked on during one of the customer segmentation projects from real life KPMG dataset that tries to segment customers on basis of their consumption patterns during the RFM analysis. Within a defined range of parameters (eps and minimum samples), it find the optimal values that give the best solution. In this, we see that, with eps of 1.2 and minimum samples of 4, the silhouette score is around 53%.
Again PCA1 and PCA2 attempt to capture the maximum variance of the dataset in comparison
Code
# Visualizationplt.figure(figsize=(15, 5))# Before PCAplt.subplot(1, 3, 1)plt.scatter(x['ex_debt_shocks'], x['gdp_growth'], c=labels_optimal_dbscan, cmap='viridis', edgecolor='k')plt.title('DBSCAN Clustering before PCA')plt.xlabel('ex_debt_shocks')plt.ylabel('gdp_growth')# After PCAplt.subplot(1, 3, 2)plt.scatter(optimal_pca_kmeans1[:, 0], optimal_pca_kmeans1[:, 1], c=labels_optimal_dbscan_pca, cmap='viridis', edgecolor='k')plt.title('DBSCAN Clustering after PCA')# After t-SNEax = plt.subplot(1, 3, 3)ax.scatter(optima_tsne_kmeans2[:, 0], optima_tsne_kmeans2[:, 1], c=labels_optimal_dbscan_tsne, cmap='viridis', edgecolor='k')ax.set_title('DBSCAN Clustering after t-SNE')ax.set_xlabel('t-SNE Component 1')ax.set_ylabel('t-SNE Component 2')plt.tight_layout()plt.show()
If we try to look at the dataset, although the silhouette scores is more for TSNE but if you look at the clusters, you see the clustering is better for DBSCAN after PCA but in terms of Silhouette scores, the answer is different.
Hierarchial Clustering
what is hierarchial clustering
how does it work
what can you do and how is it different in terms of dbscan and kmeans
Finding optimal clusters
talk about how when you want to find the optimal clusters, you can look into the dendogram and just do it.
# Silhouette Scoremax_clusters =9silhouette_scores = []for n_clusters inrange(2, max_clusters +1): agglomerative = AgglomerativeClustering(n_clusters=n_clusters) labels = agglomerative.fit_predict(x) silhouette_scores.append(silhouette_score(x, labels))# Plot the silhouette scoresplt.plot(range(2, max_clusters +1), silhouette_scores, marker='o')plt.title('Silhouette Score vs. Number of Clusters')plt.xlabel('Number of Clusters')plt.ylabel('Silhouette Score')plt.show()
We know that the agglomerative clusters is around 6, when the silhouette score is around 0.44. Now we try to do clustering on basis of before and after both the feature extraction methods of t-SNE and PCA.
Code
# Before PCAhierarchial_optimal = DBSCAN(eps=1.2, min_samples=4)labels_optimal_hierarcial = hierarchial_optimal.fit_predict(x)# After PCApca2 = PCA(n_components=4)optimal_pca_kmeans2 = pca2.fit_transform(x)hierarchal_optimal_after_pca = AgglomerativeClustering(n_clusters=6)labels_optimal_hierarchal_pca = hierarchal_optimal_after_pca.fit_predict(optimal_pca_kmeans2)# After t-SNEtsne2 = TSNE(n_components=3, perplexity=2, random_state=42)optima_tsne_hierarchial = tsne.fit_transform(x)hierarchal_optimal_after_tsne = AgglomerativeClustering(n_clusters=6)labels_optimal_hierarchal_tsne = hierarchal_optimal_after_pca.fit_predict(optima_tsne_hierarchial)
Code
# Silhouette scoressilhouette_score_before_pca1 = silhouette_score(x, labels_optimal_hierarcial)print(f"Silhouette Score before PCA: {silhouette_score_before_pca:.4f}")silhouette_score_after_pca1 = silhouette_score(optimal_pca_kmeans2, labels_optimal_hierarchal_pca)print(f"Silhouette Score after PCA: {silhouette_score_after_pca:.4f}")silhouette_score_after_tsne1 = silhouette_score(optima_tsne_hierarchial, labels_optimal_hierarchal_tsne)print(f"Silhouette Score after TSNE: {silhouette_score_after_tsne:.4f}")
Silhouette Score before PCA: 0.5285
Silhouette Score after PCA: 0.5121
Silhouette Score after TSNE: 0.5379
Here again, we see that Silhouette Score after t-SNE performs better. We shall have to check it through visualization.
Code
# Visualize clusters before hierarchical clusteringplt.figure(figsize=(18, 5))# Before Hierarchical Clusteringplt.subplot(1, 3, 1)plt.scatter(x.values[:, 0], x.values[:, 1], c=labels_optimal_hierarcial, cmap='viridis', edgecolor='k')plt.title('Clusters Before Hierarchical Clustering')plt.xlabel('Feature 1')plt.ylabel('Feature 2')# After Hierarchical Clustering with PCAplt.subplot(1, 3, 2)plt.scatter(optimal_pca_kmeans2[:, 0], optimal_pca_kmeans2[:, 1], c=labels_optimal_hierarchal_pca, cmap='viridis', edgecolor='k')plt.title('Clusters After Hierarchical Clustering (PCA)')plt.xlabel('Principal Component 1')plt.ylabel('Principal Component 2')# After Hierarchical Clustering with t-SNEplt.subplot(1, 3, 3)plt.scatter(optima_tsne_hierarchial[:, 0], optima_tsne_hierarchial[:, 1], c=labels_optimal_hierarchal_tsne, cmap='viridis', edgecolor='k')plt.title('Clusters After Hierarchical Clustering (t-SNE)')plt.xlabel('t-SNE Component 1')plt.ylabel('t-SNE Component 2')plt.tight_layout()plt.show()
When you look at hierarchial clustering, using the agglomerative clustering technique, we see that the clustering after PCA is better as you can see well-defined clusters that are not overlapping one another. You see that the clustering is better after the PCA as the best feature set is used to compare the dataset.
Code
# Create subplots for before and after PCAfig, axes = plt.subplots(1, 3, figsize=(12, 6))# Dendrogram before PCAdendrogram(linkage(x, method='ward'), ax=axes[0])axes[0].set_title('Hierarchical Clustering Dendrogram (Before PCA)')axes[0].set_xlabel('Data Points')axes[0].set_ylabel('Distance')# Dendrogram after PCAdendrogram(linkage(optimal_pca_kmeans2, method='ward'), ax=axes[1])axes[1].set_title('Hierarchical Clustering Dendrogram (After PCA)')axes[1].set_xlabel('Data Points')axes[1].set_ylabel('Distance')# Dendrogram after PCAdendrogram(linkage(optima_tsne_hierarchial, method='ward'), ax=axes[2])axes[2].set_title('Hierarchical Clustering Dendrogram (After t-SNE)')axes[2].set_xlabel('Data Points')axes[2].set_ylabel('Distance')plt.tight_layout()plt.show()